A Plot is Worth a Thousand Tests: Assessing Residual Diagnostics with the Lineup Protocol

ASC & OZCOTS 2023

Weihao (Patrick) Li

Monash University, Australia

✍️Co-authors

Professor Dianne Cook, Department of Econometrics and Business Statistics, Melbourne, Monash University, Australia

Dr. Emi Tanaka, Biological Data Science Institute, Australian National University, Canberra, Australia

Assistant Professor Susan VanderPlas, Statistics Department, University of Nebraska, Lincoln, USA

📜Liteature of Regression Diagnostics

Graphical approaches (plots) are the recommended methods for diagnosing residuals.

  • Draper and Smith (1998) and Belsley, Kuh, and Welsch (1980):

Residual plots are usually revealing when the assumptions are violated.

  • Cook and Weisberg (1982):

Formal tests and graphical procedures are complementary and both have a place in residual analysis, but graphical methods are easier to use.

  • Montgomery and Peck (1982):

Residual plots are more informative in most practical situations than the corresponding conventional hypothesis tests.

🤔Challenges in Interpreting Residual Plots

What do you observe from this residual plot?

  • Vertical spread of the points varies with the fitted values.
  • This often indicates the existence of heteroskedasticity.

🤔Challenges in Interpreting Residual Plots

  • However, this is an over-interpretation.

  • The fitted model is correctly specified!

  • The triangle shape is caused by the skewed distribution of the regressors.

We need to use an inferential framework to calibrate the reading of residual plots!

🔬Visual Inference

This framework is called visual inference (Buja, et al. 2009).

A lineup consists of

  • \(m\) randomly placed plots
  • one plot is the real residual plot (data plot)
  • remaining \(m − 1\) plots (null plots) containing residuals simulated from the fitted model.

🔬Visual Inference

This framework is called visual inference (Buja, et al. 2009).

Can you now identify the real residual plot?

  • It is not uncommon for residual plots (No. 11) of this model to exhibit a triangle shape.

  • The visual discovery is calibrated via comparison.

🧪Experimental Design

To understand why regression experts consistently recommend plotting residuals, we conducted an experiment to compare conventional hypothesis testing with visual testing in linear regression diagnostics.

🧪Experimental Design

Non-linearity model:

\[\boldsymbol{y} = \boldsymbol{1}_n + \boldsymbol{x} + \boldsymbol{z} + \boldsymbol{\varepsilon},~ \boldsymbol{z} \propto He_j(\boldsymbol{x}) \text{ and } \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n),\]

where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), \(\boldsymbol{1}_n\) is a vector of ones of size \(n\), and \(He_{j}(.)\) is the \(j\)th-order probabilist’s Hermite polynomials.

Null regression model:

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]

🧪Experimental Design

Heteroskedasticity model:

\[\boldsymbol{y} = 1 + \boldsymbol{x} + \boldsymbol{\varepsilon},~ \boldsymbol{\varepsilon} \sim N(\boldsymbol{0}, 1 + (2 - |a|)(\boldsymbol{x} - a)^2b \boldsymbol{I}),\]

where \(\boldsymbol{y}\), \(\boldsymbol{x}\), \(\boldsymbol{\varepsilon}\) are vectors of size \(n\), and \(\boldsymbol{1}_n\) is a vector of ones of size \(n\).

Null regression model:

\[\boldsymbol{y} = \beta_0 + \beta_1\boldsymbol{x} + \boldsymbol{u}, ~\boldsymbol{u} \sim N(\boldsymbol{0}_n, \sigma^2\boldsymbol{I}_n).\]

🧪Experimental Design

🧪Experimental Design

🧪Experimental Design

📏Effect size

We have chosen to use an approach based on Kullback-Leibler divergence (Kullback and Leibler, 1951).

The effect size is defined as

\[\begin{align*} E &= \frac{1}{2}\left(\log\frac{|\text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R}')|}{|\text{diag}(\boldsymbol{R}\widehat{\sigma}^2)|} - n + \text{tr}(\text{diag}(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R}')^{-1}\text{diag}(\boldsymbol{R}\widehat{\sigma}^2)) + \boldsymbol{\mu}_z'(\boldsymbol{R}\boldsymbol{V}\boldsymbol{R}')^{-1}\boldsymbol{\mu}_z\right), \\ \boldsymbol{\mu}_z &= \boldsymbol{R}\boldsymbol{Z},\\ \boldsymbol{R} &= \boldsymbol{I}_n - \boldsymbol{X}(\boldsymbol{X}'\boldsymbol{X})^{-1}\boldsymbol{X}', \end{align*}\]

where \(diag(.)\) is the diagonal matrix constructed from the diagonal elements of a matrix, and \(\boldsymbol{V}\) is the actual covariance matrix of the error term.

💪Power of Visual Tests

We use the logistic regression to estimate the power:

\[Pr(\text{reject}~H_0|H_1,E) = \Lambda\left(log\left(\frac{0.05}{0.95}\right) + \beta_1 E\right),\]

where \(\Lambda(.)\) is the standard logistic function given as \(\Lambda(z) = exp(z)/(1+exp(z))\).

  • The effect size \(E\) is the only predictor.

  • The intercept is fixed to \(log(0.05/0.95)\) so that \(\hat{Pr}(\text{reject}~H_0|H_1,E = 0) = 0.05\).

🛠️Experimental Setup

Overall, we collected 7974 evaluations on 1152 unique lineups performed by 443 subjects recruited from an crowd-sourcing platform called Prolific (Palan and Schitter, 2018).

Every subject was asked to:

  • Evaluate a block of 20 lineups.
  • Select one or more plots that are most different from others.
  • Briefly explain their selections.

⚖️Main Results: Power Comparison of Conventional Tests and Visual Tests

⚖️Non-linearity Patterns

⚖️Heteroskedasticity Patterns

The visual test rejects less frequently than the conventional test, and (almost) only rejects when the conventional test does.

🌟An example of conventional tests being too sensitive

The data plot (No.1) is undistinguishable from other plots with an extremely small effect size (\(log_e(E) = -0.48\)).

The non-linearity pattern is totally undetectable.

However, the RESET test rejects the pattern with a very small \(p\text{-value} = 0.004\). In contrast, the \(p\text{-value}\) produced by the visual test is \(0.813\).

🧐Main Conclusions

  • Conventional tests are more sensitive to weak departures than visual tests.

  • Conventional tests often reject when departures are not visibly different from null residual plots.

  • Visual tests perform equally well regardless of the type of residual departures and remove any subjective arguments about whether a pattern is visible or not.

  • Regression experts are right. Residual plots are indispensable methods for assessing model fit.

Thanks! Any questions?